N-GrAM: New Groningen Author-profiling Model
نویسندگان
چکیده
We describe our participation in the PAN 2017 shared task on Author Profiling, identifying authors’ gender and language variety for English, Spanish, Arabic and Portuguese. We describe both the final, submitted system, and a series of negative results. Our aim was to create a single model for both gender and language, and for all language varieties. Our best-performing system (on cross-validated results) is a linear support vector machine (SVM) with word unigrams and character 3to 5-grams as features. A set of additional features, including POS tags, additional datasets, geographic entities, and Twitter handles, hurt, rather than improve, performance. Results from cross-validation indicated high performance overall and results on the test set confirmed them, at 0.86 averaged accuracy, with performance on sub-tasks ranging from 0.68 to 0.98.
منابع مشابه
Cross-Genre Age and Gender Identification in Social Media
This paper gives a brief description on the methods adopted for the task of author-profiling as part of the competition PAN 2016 [1]. Author profiling is the task of predicting the author’s age and gender from his/her writing. In this paper, we follow a two-level ensemble approach to tackle the cross-genre author profiling task where training documents and testing documents are from different g...
متن کاملA Document Weighted Approach for Gender and Age Prediction Based on Term Weight Measure
Author profiling is a text classification technique, which is used to predict the profiles of unknown text by analyzing their writing styles. Author profiles are the characteristics of the authors like gender, age, nativity language, country and educational background. The existing approaches for Author Profiling suffered from problems like high dimensionality of features and fail to capture th...
متن کاملGrammar Checker Features for Author Identification and Author Profiling Notebook for PAN at CLEF 2013
Our work on author identification and author profiling is based on the question: Can the number and the types of grammatical errors serve as indicators for a specific author or a group of people? In order to detect the grammatical errors we base our approach on the output of the open-source library LanguageTool. In the case of the author identification we transform the problem into a statistica...
متن کاملTopic Models and n-gram Language Models for Author Profiling - Notebook for PAN at CLEF 2015
Author profiling is the task of determining the attributes for a set of authors. This paper presents the design, approach, and results of our submission to the PAN 2015 Author Profiling Shared Task. Four corpora, each in a different language, were provided. Each corpus consisted of collections of tweets for a number of Twitter users whose gender, age and personality scores are know. The task wa...
متن کاملPAN 2017: Author Profiling - Gender and Language Variety Prediction
We present the results of gender and language variety identification performed on the tweet corpus prepared for the PAN 2017 Author profiling shared task. Our approach consists of tweet preprocessing, feature construction, feature weighting and classification model construction. We propose a Logistic regression classifier, where the main features are different types of character and word n-gram...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1707.03764 شماره
صفحات -
تاریخ انتشار 2017